feature importance
MinShap: A Modified Shapley Value Approach for Feature Selection
Zheng, Chenghui, Raskutti, Garvesh
Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through numerical simulations and real data experiments that MinShap tends to outperform state-of-the-art feature selection algorithms such as LOCO, GCM and Lasso in terms of both accuracy and stability. We also introduce a suite of algorithms related to MinShap by using the multiple testing/p-value perspective that improves performance in lower-sample settings and provide supporting theoretical guarantees.
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > California (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
Regional Explanations: Bridging Local and Global Variable Importance
Amoukou, Salim I., Brunel, Nicolas J-B.
We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance's feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.
RFX-Fuse: Breiman and Cutler's Unified ML Engine + Native Explainable Similarity
Breiman and Cutler's original Random Forest was designed as a unified ML engine -- not merely an ensemble predictor. Their implementation included classification, regression, unsupervised learning, proximity-based similarity, outlier detection, missing value imputation, and visualization -- capabilities that modern libraries like scikit-learn never implemented. RFX-Fuse (Random Forests X [X=compression] -- Forest Unified Learning and Similarity Engine) delivers Breiman and Cutler's complete vision with native GPU/CPU support. Modern ML pipelines require 5+ separate tools -- XGBoost for prediction, FAISS for similarity, SHAP for explanations, Isolation Forest for outliers, custom code for importance. RFX-Fuse provides a 1 to 2 model object alternative -- a single set of trees grown once. Novel Contributions: (1) Proximity Importance -- native explainable similarity: proximity measures that samples are similar; proximity importance explains why. (2) Dataset-specific imputation validation for general tabular data -- ranking imputation methods by how real the imputed data looks, without ground truth labels.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Natural Language (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
- Information Technology > Data Science > Data Mining (0.68)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- (2 more...)
- North America > United States > Maryland (0.04)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Arizona > Pima County > Tucson (0.14)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- Oceania > New Zealand > North Island > Waikato (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science > Data Mining (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)